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Sequence analyses of RNA virus genomes remain challenging owing to the exceptional genetic plasticity of these viruses. 
Because of high mutation and recombination rates, genome replication by viral RNA-dependent RNA polymerases leads 
to populations of closely related viruses, so-called “quasispecies.” Standard (short-read) sequencing technologies are ill-suit- 
ed to reconstruct large numbers of full-length haplotypes of (I) RNA virus genomes and (2) subgenome-length (sg) RNAs 
composed of noncontiguous genome regions. Here, we used a full-length, direct RNA sequencing (DRS) approach based on 
nanopores to characterize viral RNAs produced in cells infected with a human coronavirus. By using DRS, we were able to 
map the longest (~26-kb) contiguous read to the viral reference genome. By combining Illumina and Oxford Nanopore 
sequencing, we reconstructed a highly accurate consensus sequence of the human coronavirus (HCoV)-229E genome 
(27.3 kb). Furthermore, by using long reads that did not require an assembly step, we were able to identify, in infected cells, 
diverse and novel HCoV-229E sg RNAs that remain to be characterized. Also, the DRS approach, which circumvents reverse 
transcription and amplification of RNA, allowed us to detect methylation sites in viral RNAs. Our work paves the way for 
haplotype-based analyses of viral quasispecies by showing the feasibility of intra-sample haplotype separation. Even though 
several technical challenges remain to be addressed to exploit the potential of the nanopore technology fully, our work 
illustrates that DRS may significantly advance genomic studies of complex virus populations, including predictions on 


long-range interactions in individual full-length viral RNA haplotypes. 


[Supplemental material is available for this article.] 


Coronaviruses (subfamily Coronavirinae, family Coronaviridae, or- 
der Nidovirales) are enveloped positive-sense (+) single-stranded 
(ss) RNA viruses that infect a variety of mammalian and avian hosts 
and are of significant medical and economic importance, as illus- 
trated by recent zoonotic transmissions from diverse animal hosts 
to humans (Vijay and Perlman 2016; Menachery et al. 2017). The 
genome sizes of coronaviruses (~30 kb) exceed those of most other 
RNA viruses. Coronaviruses use a special mechanism called discon- 
tinuous extension of minus strands (Sawicki and Sawicki 1995, 
1998) to produce a nested set of 5’- and 3’-coterminal subgenomic 
(sg) mRNAs that carry acommon S’ leader sequence that is identi- 
cal to the 5’-end of the viral genome (Zuniga et al. 2004; Sawicki 
et al. 2007). These sg mRNAs contain a different number of open 
reading frames (ORFs) that encode the viral structural proteins 
and several accessory proteins. With very few exceptions, only 
the 5’-located ORF (which is absent from the next smaller sg 
mRNA) is translated into protein (Fig. 1). 

In HCoV-229E-infected cells, a total of seven major viral 
RNAs are produced. The viral genome is also referred to as mRNA 
1 because it has an MRNA function. In its 5’-terminal region, the 
genome RNA contains two large ORFs, la and 1b, that encode 
the viral replicase polyproteins 1a and lab. mRNAs 2, 4, 5, 6, 
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and 7 are used to produce the S protein, accessory protein 4, E pro- 
tein, M protein, and N protein, respectively. The 5’-region of 
mRNA 3 contains a truncated fragment of ORF S, which is consid- 
ered defective. Although this sg RNA has been consistently identi- 
fied in HCoV-229E-infected cells, its mRNA function has been 
disputed, and there is currently no evidence that this RNA is trans- 
lated into protein (Schreiber et al. 1989; Raabe et al. 1990; Thiel 
et al. 2003). 

Like many other +RNA viruses, coronaviruses show high rates 
of recombination (Lai 1992; Liao and Lai 1992; Furuya et al. 1993). 
In fact, the mechanism to produce S’ leader-containing sg mRNAs 
represents a prime example for copy-choice RNA recombination 
that, in this particular case, is guided by complex RNA-RNA inter- 
actions involving the transcription-regulating sequence (TRS) core 
sequences and likely requires additional interactions of viral pro- 
teins with specific RNA signals. In other virus systems, RNA recom- 
bination has been shown to generate “transcriptional units” that 
control the expression of individual components of the genome 
(Holmes 2009). The mechanisms involved in viral RNA recombi- 
nation are diverse and may even extend to nonreplicating systems 
(Gallei et al. 2004). In the vast majority of cases, recombination 
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Figure 1. Scheme of genomic and subgenomic (sg) RNAs produced in 
HCoV-229E-infected cells (Raabe et al. 1990; Schreiber et al. 1989). 
Translation of the 5’-terminal ORF(s) of the respective mRNA gives rise to 
the various viral structural and nonstructural proteins (indicated by differ- 
ent colors). mRNA 3 is considered defective and unlikely to be translated 
into protein. Each mRNA has a 3’ poly(A) tail and carries a 5’-leader se- 
quence that is identical to the 5’-end of the genome. In a process called 
discontinuous extension of negative strands, negative-strand RNA synthe- 
sis is attenuated at specific genomic positions. After transfer of the nascent 
strand to an upstream position on the template, RNA synthesis is contin- 
ued by copying the 5’-leader sequence. As a result, the 3’-ends of corona- 
virus minus-strand mRNAs are equipped with the complement of the 5’- 
leader. The latter process is guided by base-pairing interactions between 
the transcription-regulating sequence (TRS) located immediately down- 
stream from the leader sequence (TRS-L) at the 5’-end of the genome 
and the sequence complement of a TRS located upstream of one of the 
ORFs present in the 3’-proximal genome region (TRS-B). 


results in defective RNA (dRNA) copies that lack essential cis-active 
elements and thus cannot be replicated. In other cases, functional 
recombinant RNA with new properties, such as the ability to 
replicate in a new host, may emerge (Chang et al. 1994, 1996; 
Brown et al. 2007; Gustin et al. 2009). In yet other cases, defective 
interfering RNAs (DI-RNAs) may be produced. These defective 
(sg-length) RNAs contain all the cis-acting elements required for 
efficient replication by a helper virus polymerase and, therefore, 
represent parasitic RNAs that compete for components of the viral 
replication/transcription complex with nondefective viral RNAs 
(Pathak and Nagy 2009). 

To elucidate the many facets of recombination and to deter- 
mine full-length haplotypes of, for example, virus mutants/ 
variants in complex viral populations (quasispecies), long-read se- 
quencing has become the method of choice. Short-read second- 
generation sequencing technologies—such as Ion Torrent and 
Illumina—are restricted by read length (200-400 nt) (Hélzer and 
Marz 2017). For example, the use of highly fragmented viral 
RNAs considerably complicates the investigation of haplotypes 
(Nowak 1992; Baaijens et al. 2017). Because the nested coronavirus 
mRNAs are almost identical to the original genome sequence, 
short-read data can usually not be unambiguously assigned to spe- 
cific sg RNA species. 

In this study, we performed direct RNA sequencing (DRS) 
on an array of nanopores, as developed by Oxford Nanopore 
Technologies (ONT) (Garalde et al. 2018). Nanopore sequencing 
does not have a limited reading length but is limited only by frag- 
mentation of the input material (Mikheyev and Tin 2014; Chua 
and Ng 2016; Jain et al. 2016). Further, by using DRS, we avoid sev- 
eral drawbacks of previous sequencing methods, in particular 
cDNA synthesis and amplification of the input material. Thus, 
for example, cDNA synthesis can create artificial RNA-RNA chi- 
merae (Karst et al. 2018) that are difficult to discriminate from 
naturally occurring chimerae (such as spliced RNAs). Also, amplifi- 
cation before sequencing would remove all RNA modifications 
from the input material, whereas the nanopore sequencing tech- 
nology preserves these modifications (Smith et al. 2017; Garalde 
et al. 2018). 


Recently, nanopore sequencing has been used for metage- 
nomic forays into the virosphere (Warwick-Dugdale et al. 2019) 
and studies focusing on transmission routes (Quick et al. 2016; 
Faria et al. 2017). Furthermore, viral transcriptomes have been in- 
vestigated using nanopore sequencing of cDNA (Moldovan et al. 
2017, 2018a,b; Tombacz et al. 2017), being subject to bias from 
reverse transcription and amplification. Other studies used DRS 
to study the human poly(A) transcriptome (Workman et al. 
2018) and the transcriptome of DNA viruses such as HSV 
(Depledge et al. 2018). Furthermore, the genome of influenza A vi- 
rus has been completely sequenced in its original form using DRS 
(Keller et al. 2018). 

In the present study, we sequenced one of the largest known 
RNA genomes, that of HCoV-229E, a member of the genus 
Alphacoronavirus, with a genome size of ~27,300 nt, in order to as- 
sess the complex architectural details for viral sg RNAs produced in 
cells infected with recombinant HCoV-229E. By using DRS, we aim 
to capture complete viral mRNAs, including the full coronavirus 
genome, in single contiguous reads. Sequence analysis of thou- 
sands of full-length sg RNAs will allow us to determine the archi- 
tectures (including leader-body junction sites) of the major viral 
mRNAs. In addition, this approach provides insight into the diver- 
sity of additional HCoV-229E sg RNAs, probably including DI- 
RNAs. Further, we aim to assess whether RNA modifications can 
be called directly from the raw nanopore signal of viral molecules 
without prior in vitro treatment, as has been shown for DNA 
(Stoiber et al. 2016; McIntyre et al. 2019). 


Results 


Full-genome sequencing without amplification 


We sequenced total RNA samples obtained from Huh7 cells in- 
fected with serially passaged recombinant human coronaviruses: 
wild-type (WT) HCoV-229E, HCoV-229E_SL2-SARS-CoV, and 
HCoV-229E_SL2-BCoV, respectively. In the latter two viruses, a 
conserved stem-loop structure (SL2) residing in the HCoV-229E 
5’ UTR was replaced with the equivalent SL2 element from SARS- 
CoV and BCoV, respectively (Madhugiri et al. 2018). Total RNA 
samples obtained for the latter two (chimeric) viruses were pooled 
before sequence analysis. Hereafter, we refer to the first sample as 
WT RNA and to the second (pooled) sample as SL2 RNA (see 
Methods). 

We performed two DRS runs (one per sample) on a MinION 
Nanopore sequencer. As shown in Table 1, we achieved a through- 
put of 0.237 and 0.282 Gb with 225,000 and 181,000 reads for 
the WT and SL2 sample, respectively. See Supplemental Figure 
S1A for an overview of the read length distribution. For the WT 
and SL2 samples, 33.3% and 35.9% of the reads mapped to the ref- 
erence HCoV-229E sequences, respectively; 15.8% and 10.2%, re- 
spectively, mapped to the yeast enolase 2 mRNA sequence, a 
calibration strand added during the library preparation, whereas 
47.4% and 52.7% could be attributed to human host cell RNA. 
Minimap2 (Li 2018) did not align the remaining 3.50% and 
1.11% of reads. Using BLAST (Altschul et al. 1990) against the nu- 
cleotide database, 18.1% and 20.7% of these reads can be attribut- 
ed again to HCoV, human or yeast. As reads that were not aligned 
by minimap2 were mostly very short (median <200 nt), of poor 
basecalling quality and represented only 0.62% and 0.15% of total 
nucleotides, respectively, we decided to only use the higher quality 
reads that minimap2 could align (for detailed statistics, see 
Supplemental Fig. $2). 
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Table 1. Sequencing and error statistics 


No. of reads % % % % % 
Sample Subset (% reads) nucleotides Longest Median subst insert. deletions errors 
WT Complete sample 224,724 (100.0) 100.00 26,210 826 _ _ — — 
HCoV-229E reference 74,783 (33.3) 42.52 26,210 1414 4.292 2.558 8.264 15.114 
Mapped to __H. sapiens 106,618 (47.4) 46.37 9562 816 4.333 2.676 8.572 15.581 
5. cerevisiae ENO2 35,454 (15.8) 10.50 3482 636 3.752 2.384 6.359 12.494 
Unmapped 7869 (3.5) 0.62 1157 186 — _— — _— 
SL2 Complete sample 180,906 (100.0) 100.00 25,885 1342 _ — — — 
HCoV-229E 64,995 (35.9) 48.83 25,885 1626 4.396 2.680 8.507 15.582 
w/SARS-CoV SL2 
Mapped to _—_H. sapiens 95,340 (52.7) 45.44 16,030 1023 4.513 2.783 8.775 16.071 
S. cerevisiae ENO2 18,530 (10.2) 5.58 3872 858 4.021 2.463 6.892 13.376 
Unmapped 2041 (1.1) 0.15 928 200 _ — — — 


Both samples contain mainly HCoV-229E and host (Homo sapiens) transcripts, but also Saccharomyces cerevisiae enolase 2 (ENO2) mRNA reads (which 
was used as a calibration standard added during library preparation). Half of the sequencing errors were deletion errors, probably resulting to a large 
extent from basecalling at homopolymer stretches. The S. cerevisiae enolase 2 mRNA reads display an overall reduced error rate (bold) because the 
Albacore basecaller was trained on this calibration strand. Note that all error rates report differences to the reference genome and thus include actual 


genetic variation. 


The visualized raw voltage signal of a nanopore read is com- 
monly called “squiggle” (see Supplemental Fig. S3). Different 
from all previous sequencing technologies, nanopore sequencing 
preserves the information about base modifications in the raw 
signal (Garalde et al. 2018). However, one of the biggest challenges 
is the accurate mapping of the raw voltage signal to bases 
(“basecalling”). 

As expected for nanopore DRS (Garalde et al. 2018; Keller 
et al. 2018), reads had a median uncorrected error rate of ~15% 
for human and virus reads, whereas basecalling errors were re- 
duced for yeast ENO2 mRNA reads, as the basecaller was trained 
on this calibration strand (see Table 1). This included gaps but 
omitted discontinuous sites >6 nt because they indicated recombi- 
nation. Half of all errors were deletions. In addition, we found that 
more than half of all single-nucleotide deletions occur in homo- 
polymers, and most of these stretches that coincide with a deletion 
are >3 nt long (see Supplemental Fig. S4). A quarter of the errors 
were substitutions, which we argue are largely because of modified 
bases that impede the basecaller’s ability to assign bases correctly. 

The HCoV-229E genome was 99.86% covered, with a large 
coverage bias toward both ends (see Figs. 1, 2). The high coverage 
of the 3’-end reflects the higher abundance of mRNAs produced 
from the 3’-terminal genome regions and is a result of the discon- 
tinuous transcription mechanism used by coronaviruses and sev- 
eral other nidoviruses (Pasternak et al. 2006; Sawicki et al. 2007; 
Sola et al. 2015). The 3’-coverage is further increased by the direc- 
tional sequencing that starts from the mRNA 3’-terminal poly(A) 
tail. Also, the observed coverage bias for the very 5’-end results 
from the coronavirus-specific transcription mechanism because 
all viral mRNAs are equipped with the 65-nt 5’-leader sequence de- 
rived from the 5’-end of the genome. The remainder of the high 5’- 
coverage bias likely reflects the presence of high numbers of DI- 
RNAs in which 5’- and 3’-proximal genomic sequences were fused, 
probably resulting from illegitimate recombination events as 
shown previously for other coronaviruses (Liao and Lai 1992; 
Furuya et al. 1993; Luytjes et al. 1996). For the WT and SL2 sam- 
ples, 38.37% and 16.32% were split-mapped, respectively. Of 
these, only 278 and 181 had multiple splits. The considerably larg- 
er fraction of split reads in the WT sample is explained by the high 
abundance of potential DI-RNA molecules (see Figure 2C). 

An alignment of the longest reads from both samples to the 
HCoV-229E reference indicates that they represent near complete 


virus genomes (Supplemental Fig. S1B). The observed peaks in the 
aligned reads length distribution (see Fig. 4, below) corresponded 
very well with the abundances of the known mRNAs produced 
in HCoV-229E-infected cells (see Fig. 1; Schreiber et al. 1989; 
Raabe et al. 1990; Thiel et al. 2003). Alignment of the reads to these 
canonical mRNA sequences confirmed these observed abundances 
(Supplemental Fig. S5). 

The median read length for the combined set of reads from 
both samples was 826 nt, with a maximum of 26,210 nt, covering 
99.86% of the 27,276-nt-long virus genome, missing only 21 nt at 
the 5’-end, 15 nt at the 3’-end, and those nucleotides that corre- 
spond to the skewed error distribution, with 5.7 percentage points 
more deletions than insertions (see Table 1). The median read 
length might sound short; however, most of the viral RNAs (in- 
cluding many DI-RNAs) identified in HCoV-229E-infected cells 
were <2000 nt in length. Furthermore, this number nearly doubles 
the longest read length that can be obtained with short-read se- 
quencing methods. We observed an abundance of very short reads, 
representing the 3’ (poly[A]) end of the genome. This could be an 
artifact of RNA degradation, although we cannot estimate the ex- 
act fraction of affected transcripts. Because sequencing starts at 
the poly(A) tail, fragmented RNA will not be sequenced beyond 
any 3’ break point. It is thus best to minimize handling time during 
RNA extraction and library preparation. Innovations in these fields 
will directly translate into larger median read lengths. 

We obtained 99.15% and 98.79% identity in both samples 
(WT, SL2), respectively, with the help of the consensus caller 
Ococo (Brinda et al. 2017) using the reference genomes and all 
reads mapping to it. We attempted a standard long-read assembly 
using Canu (Koren et al. 2017), which yielded unusable results 
(WT: 389 contigs, longest 13 kb, all other <4 kb; SL2: 517 contigs, 
all <6 kb). We think that current nanopore-only assembly tools are 
not equipped to handle special read data sets such as those origi- 
nating from a small RNA virus genome. In addition, we assembled 
WT and SL2 consensus sequences using Nanopore and Illumina 
data with HG-CoLoR (Morisse et al. 2018) in an approach that 
uses long nanopore reads to traverse an assembly graph construct- 
ed from short Illumina reads. We thereby recovered 99.57% of the 
reference genome in a single contiguous sequence at 99.90% se- 
quence identity to reference using this approach with the single 
longest read from the SL2 sample. This hybrid approach illustrates 
how short- and long-read technologies can be combined to 
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Figure 2. Reference genome coverage of the HCoV-229E WT sample (blue) and the SL2 sample (orange) based on alignments with minimap2. There is 
an inverse correlation between sg RNA abundance and length. (A) Notable vertical “steps” in the coverage correspond to borders expected for the canon- 
ical sg RNAs (see Fig. 1). (B) The presence of the leader sequence (~65 nt) in canonical sg RNAs gives rise to the sharp coverage peak at the 5’-end. (C) We 
also observed unexpected “steps,” especially in the WT sample (blue). We hypothesize that the sequences correspond to DI-RNA molecules that may arise 
by recombination at TRS-like sequence motifs as well as other sites displaying sequence similarities that are sufficient to support illegitimate recombination 
events (see Fig. 3). We attribute the difference in the observed (noncanonical) recombination sites between the two samples to biological factors that we 


either did not control for or do not know (see also legend to Fig. 3). 


reconstruct long transcripts accurately, which will greatly facilitate 
studies of haplotypes. 


Uncharacterized sg and DI-RNAs 


In addition to the leader-to-body junctions expected for the ca- 
nonical sg mRNAs 2, 4, 5, 6, and 7, we observed a high number 
of recombination sites (Fig. 3) that were consistently found in 
our samples but have not been described previously (Fig. 3). In 
this study, we defined a recombination site as any site that flanks 
more than 100 consecutive gaps, as determined in a discontinuous 
mapping (“spliced” mapping). Although there is currently no con- 
sensus on how to define such sites, we believe this to be a conser- 
vative definition, as this type of pattern is unlikely to result from, 
for example, miscalled homopolymer runs that, in our experience, 
typically affect less than 10 consecutive bases. We observe all 
known canonical HCoV-229E mRNAs at their expected lengths, 
including the (presumably) noncoding mRNA 3 (Fig. 4). 

The aligned reads distribution revealed clusters for all known 
mRNAs which closely fit the expected molecule lengths (Fig. 4). 
The cluster positions show double peaks with a consistent distance 
of ~65 nt, that is, the length of the leader sequence. We observed 
that the 5’-end of reads has larger-than-average error rates and is of- 
ten missing nucleotides (for detailed statistics, see Supplemental 
Fig. S8). This might be because of a bias of the basecaller toward 
the end of reads. This is plausible because the underlying classifi- 
cation algorithm is a bidirectional (i.e., forward and backward 
looking) long-short-term memory neural network (LSTM). The 
mapping algorithm was often unable to align these erroneous 5’- 
ends, leading to soft-clipped bases. Thus, for many reads represent- 
ing canonical mRNAs, the included leader sequence was not 
aligned, which gives rise to the secondary peak at each cluster po- 
sition. We also observed additional clusters that likely correspond 
to highly abundant dRNAs (Fig. 4). 

We also observed several unexpected recombination sites, for 
example, at positions 3000-4000 (within ORF 1a, see Fig. 3). These 


sites were confirmed by both Nanopore and Illumina sequencing. 
They had a high read support and defined margins, suggesting a 
specific synthesis/amplification of these sg RNAs that, most likely, 
represent DI-RNAs. Because DI-RNAs are byproducts of viral repli- 
cation and transcription, they present a larger diversity than the 
canonical viral mRNAs (Penzes et al. 1994, 1996; Méndez et al. 
1996; Brian and Spaan 1997; Izeta et al. 1999). 

Nanopore sequencing captures recombination events far bet- 
ter than Illumina, which allowed us to identify even complex sg 
RNAs (composed of sequences derived from more than two non- 
contiguous genome regions) at much higher resolution: For exam- 
ple, we found sg RNAs with up to four recombination sites in the 
S’- and 3/-terminal genome regions (Fig. 3). 


Consistent 5mC methylation signatures of coronavirus RNA 


Nanopore sequencing preserves information on nucleotide modi- 
fications. By using a trained model, DNA and RNA modifications 
such as 5mC methylation could be identified (Fig. 5). To assess 
the false-positive rate (FPR) of the methylation calling, we used 
an unmethylated RNA calibration standard (RCS) as a negative 
control that was added in the standard library preparation protocol 
for DRS. We considered a position to be methylated if at least 90% 
of the reads showed a methylation signal for this particular posi- 
tion. By using this threshold, the estimated FPR was calculated to 
be <5%. Our experimental setup did not include a positive meth- 
ylation control. 

When analyzing SmC methylation across various RNAs, we 
observed consistent patterns (Fig. 5) that were reproducible for 
the corresponding genomic positions in different RNAs, suggest- 
ing that the methylation of coronavirus RNAs is sequence-specific 
and/or controlled by RNA structural elements. Methylated nucleo- 
tides could be identified across the genome, both in the leader se- 
quence and in the body regions of viral mRNAs. 

Although the overall methylation pattern looks similar 
between sg RNAs and the negative control (see Supplemental 


4 Genome Research 
www.genome.org 


Downloaded from genome.cshlp.org on August 23, 2019 - Published by Cold Spring Harbor Laboratory Press 


Nanopore long and direct RNA-seq of coronaviruses 


uo09s 


yuo080z 
100r9 


7200nt 


14400nt 42g00nt 


13600nt 


HCoV-229E 


Figure 3. Joining of noncontiguous genome sequences in sg RNAs identified in HCoV-229E-infected cells. On the circular axis, the annotations of the 
reference genome (including 5’ UTR [5U] and 3’ UTR [3U]) are shown. Genomic positions of “discontinuous sites” identified in Illumina reads (A; outer 
track), nanopore reads of sample HCoV-229E WT (B; middle track), and nanopore reads of SL2 sample (CG; inner track) reveal multiple recombination sites 
across the whole genome. An aggregation of recombination sites can be observed in the region that encodes the viral N protein. Furthermore, clear re- 
combination sites can be seen at intergenic boundaries and at the 5’ and 3’ UTRs, with the former corresponding to the boundary between the leader 
sequence and the rest of the genome. Another prominent cluster can be observed in ORF1a in the WT nanopore sample but not in SL2. This cluster is 
supported by the WT Illumina data, excluding sequencing bias as a potential source of error. We hypothesize that because samples WT and SL2 were ob- 
tained from nonplaque-purified serially passaged virus populations derived from in vitro transcribed genome RNAs transfected into Huh-7 cells, differences 
in the proportion of full-length transcripts versus abortive transcripts could translate into different patterns of recombination. Generally, nanopore-based 
sequencing allows more detailed analysis of recombination events owing to the long read length. Even complex isoforms such as two exemplary reads, 
each with four discontinuous segments, can be observed (blue and pink). 


As indicated above, only 12% of the sg RNAs were found to 
conform to our current understanding of discontinuous mRNA 
transcription in coronaviruses, resulting in mRNAs that (all) carry 
an identical 5’-leader sequence that is fused to the 3’-coding 
(“body”) sequence of the respective mRNA. We, however, believe 
that 12% represents an underestimate because a large number of 


Fig. S9), we nevertheless find consistent methylation across differ- 
ent sg RNA “types”; that is, methylated positions of mRNA 2 are 
mirrored in mRNA 4 etc. 


Discussion 


We identified highly diverse sg RNAs in coronavirus-infected cells, 
with many sg RNAs not corresponding to the known canonical 
mRNAs. These “noncanonical” sg RNAs had abundant read sup- 
port, and full-length sequences could be obtained for most of these 
RNAs. 


sg RNAs were probably omitted from the analysis: First, RNA mol- 
ecules degrade rapidly under laboratory conditions, even when 
handled carefully. The resulting fragments will only be sequenced 
if they contain a poly(A) tail. Second, the high sequencing error 
may introduce mismappings, especially for low-quality reads. 
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Figure 4. Distribution of aligned read lengths up to 8000 nt for both samples based on alignments with minimap2. We observed clusters that correspond 
well with the transcript lengths expected for canonical mRNAs (vertical dotted lines). Alignment of the reads to these canonical mRNA sequences confirmed 
these observed abundances (Supplemental Fig. $5). The distribution shows double peaks at the cluster positions because reads corresponding to mRNAs 
often miss the leader sequence, possibly owing to basecalling or mapping errors. We also observed additional clusters that likely correspond to highly abun- 
dant DI-RNAs. (A) A cluster in the WT sample that represents chimeric ~820-nt sg RNAs composed of both 5’- and 3’-terminal genome regions (~540 and 
~280 nt, respectively). We propose the template switch for this transcript occurs around position 27,000 and RNA synthesis resumes at around position 540 
(see corresponding peaks in Fig. 3, track B). (B) A cluster from the SL2 sample that contains transcripts with an approximate length of 5150 nt, which rep- 
resents mRNA 3 (see Fig. 1). These transcripts are probably formed owing to a transcription stop at a TRS motif around position 22,150 (see corresponding 


peak in Fig. 3, track C). 


These reads would not be assigned to the canonical model under 
our assumptions because of the high number of mismatches. 
However, we think the associated bias is low, because minimap2 
is very robust against high error rates and because the reads are 
very long, thus ensuring that the mapper has sufficient aggregate 
information on a given read to position it very reliably on a refer- 
ence. Third, the library preparation protocol for DRS includes the 
ligation of adapters via a T4 ligase. Any ligase could potentially in- 
troduce artificial chimera, although we did not investigate this sys- 
tematically. Again, we think that this does not affect our results 
substantially: (1) This bias is random, and it seems unlikely that 
we would observe the very same RNA “isoform” many times if it 
was created by random ligation; (2) many “isoforms” that we ob- 
served only once (e.g., those colored pink and blue in Fig. 3) 
were structured plausibly: They contained a leader sequence and 
had recombined at expected (self-similar) sites corresponding to 
putative or validated TRSs, with downstream sequences being ar- 
ranged in a linear 5’-3’-order. Fourth, finally, it is important to 
note that the RNA used for DRS was isolated from cells infected 
with a serially passaged pool of recombinant viruses rescued after 
transfection of in vitro transcribed genome-length (27.3-kb) 
RNAs. Transfection of preparations of in vitro transcribed RNA of 
this large size likely included a significant proportion of abortive 
transcripts that lacked varying parts of the 3’ genome regions, 
rendering them dysfunctional. It is reasonable to suggest that 
the presence of replication-incompetent RNAs lacking essential 
3’-terminal genome regions may have triggered recombination 
events resulting in the emergence of DI-RNAs that contained all 
the 5’- and 3’-terminal cis-active elements required for RNA replica- 
tion but lacked nonessential internal genome regions. Upon serial 
passaging of the cell culture supernatants for 21 times, DI-RNAs 
may have been enriched, especially in the HCoV-229E (WT) sam- 


ple (Fig. 3). Comparative DRS analyses of RNA obtained from 
cells infected with (1) plaque-purified HCoV-229E and (2) newly 
rescued recombinant HCoV-229E (without prior plaque purifica- 
tion), respectively, would help to address the possible role of pre- 
maturely terminated in vitro transcripts produced from full- 
length cDNA in triggering the large number of DI-RNAs observed 
in our study. 

Although, for the above reasons, the low percentage of ca- 
nonical mRNAs (12%) in our samples likely represents an underes- 
timate, our study may stimulate additional studies, for example, to 
revisit the production of mRNAs from noncanonical templates 
(Wu and Brian 2010; Sola et al. 2011). Also, it is worth mentioning 
in this context that, for several other nidoviruses, such as murine 
hepatitis virus (MHV), bovine coronavirus (BCoV), and arterivi- 
ruses, evidence has been obtained that sg RNA transcription may 
also involve noncanonical TRS motifs joo and Makino 1992; Lai 
and Cavanagh 1997; Lai 1998; Ozdarendeli et al. 2001; Alonso 
et al. 2002). 

The majority of sg RNAs (other than mRNAs) we found in our 
samples likely represent DI-RNAs, which are acommon occurrence 
in coronavirus in vitro studies (Pathak and Nagy 2009). 

To our knowledge, this study is the first to perform RNA mod- 
ification calling without prior treatment of the input sample. It 
only relies on the raw nanopore signal. Although DNA modifica- 
tions such as SmC methylation have been explored extensively 
(Breiling and Lyko 2015), less is known about RNA modifications 
(Roundtree et al. 2017), the importance of which is debated 
(Grozhik and Jaffrey 2017). We found consistent 5mC methyla- 
tion patterns across viral RNAs when tested at a FPR <5%. We 
were not able to assess the sensitivity and specificity of the meth- 
ylation calling owing to the absence of a positive control group, 
which was beyond the scope of this study. 
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Figure 5. 5mC methylation of various annotated coronavirus mRNAs. 
The transcripts have been aligned such that corresponding genomic posi- 
tions can be found in the same vertical column across facets. Both the lead- 
er sequence (including the TRS) and the nested sg sequences show 
consistent patterns of methylation across transcripts. Exemplary positions 
that display consistent methylation across all investigated transcripts are 
marked as red vertical lines. Note that although coronavirus recombina- 
tion uses two TRSs, the resulting transcript has only one TRS, because of 
self-similarity-based pairing of the TRS. Positions have been labeled “meth- 
ylated” if at least 90% of the reads show a methylation signal. Using this 
threshold, the estimated FPR is <5%. 


RNA is known to have many different modifications, and we 
expect the presence of these on coronavirus sg RNAs (Machnicka 
et al. 2013) too. However, to our knowledge no comprehensive 
data exist on prior expectations for such modifications in corona- 
viruses, which might or might not correspond to those observed in 
for example, humans. 

In addition, we observed that the software used (Stoiber et al. 
2016) will likely present high error rates in regions of low coverage 
or where the underlying reference assembly is erroneous. This is 
because the resquiggle algorithm—upon which this method is 
based—has to align the raw nanopore read signal to the basecalled 
read sequence (see Supplemental Fig. S6). This is necessary to test 
the raw signal against learned modification models, of which at 
the time of manuscript preparation (May 2019) only S5mC was 
implemented for RNA. Nevertheless, new options to call these 
modifications at an acceptable error rate without any RNA pretreat- 
ment is a powerful method. 

The validity of the methylation signal should be confirmed 
in future studies using, for example, bisulfite sequencing. Ideally, 
this validation should start from in vitro synthetic transcripts in 
which modified bases have been inserted in known positions. 
Furthermore, RNA modification detection from single-molecule 
sequencing is a current bioinformatic frontier, and algorithms 
and tools are under active development. We showed that consis- 
tent SmC methylation patterns were seen across different sg 
RNAs. However, the overall pattern of the methylation calls be- 
tween sg RNAs and the negative control was very similar. At a 


FPR of 5%, the RNA modifications we identified are supported by 
their consistent occurrence. However, we cannot rule out that 
instead, the observed pattern might be caused by an alignment 
artifact. In the used methylation calling algorithm, the raw signal 
is aligned to the nucleotide sequence after basecalling. If there is a 
systematic bias in this alignment and certain sequence motifs 
cause a consistent mapping mismatch, this mismatch could lead 
to false-positive methylated sites. This is because in these positions 
the signal would deviate from the expected one owing to the 
misalignment and not owing to methylation. In future experi- 
ments, this can be decided using a positive control in the form 
of an RNA transcript with known 5mC methylated sites. However, 
even if we are in its early stages, the reading of RNA modifications 
from the read signal has great potential to elucidate viral biology. 

We were able to reconstruct accurate consensus sequences, 
for both the lumina and Nanopore data. We also showed that in- 
dividual transcripts can be characterized. More problematic was 
the resolution of quasispecies in our experimental setup. 
Although DRS allowed us to confirm the presence of each of the 
two heterologous SL2 structures present in the SL2 sample, this 
was only possible for sg-length (DI-) RNAs. It appears that the 
high error rate of >10% was a critical limitation when analyzing 
the SL2 region located at the extreme 5’-terminal end of the 
27.3-kb genome RNA. This high error rate made variant calling dif- 
ficult, particularly under low-coverage conditions. The current 
generation of long-read assemblers is not well suited to reconstruct 
many viral genome architectures, such as nested ones. The devel- 
opment of specialized assemblers would be of great help in virolo- 
gy projects. 

We used a hybrid error correction method (HG-CoLoR) 
(Morisse et al. 2018) that uses Illumina data to correct read-level er- 
rors. However, it remains questionable whether the corrected read 
sequence is truly representative of the ground truth read sequence. 
Signal-based correction methods such as Nanopolish (Loman et al. 
2015) may be more promising; however at the time of manuscript 
writing (May 2019), correction on direct RNA data has not been 
implemented. We expect this to become available in the near fu- 
ture. Combined with the ever-increasing accuracy of the nanopore 
technology, we think this method might be able to study quasispe- 
cies soon. 

There are recombination events observed in the Illumina data 
that were not detected in the nanopore data. These are likely 
caused by misalignment of the short single-end reads (SO nt). A 
minimum of only 10 nt was required for mapping on either end 
of the gapped alignment. This was a trade-off between sensitivity 
to identify recombination sites and unspecific mapping. 

In this work, we showed the potential of long-read data as 
produced by nanopore sequencing. We were able to directly se- 
quence the RNA molecules of two different samples of one of the 
largest RNA virus genomes known to date. We showed how very 
large RNA genomes and a diverse set of sg RNAs with complex 
structures can be investigated at high resolution without the 
need for a prior assembly step and without the bias introduced 
by cDNA synthesis that is typically required for transcriptome 
studies. 

The detail and quality of the available data still require signif- 
icant bioinformatic expertise as the available tooling is still at an 
early development stage. However, the technological potential of 
nanopore sequencing for new insights into different aspects of vi- 
ral replication and evolution is very promising. 

Future studies should investigate both strands of the corona- 
virus transcriptome. Studies focusing on RNA modifications need 
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to use well-defined positive and negative controls to assess the er- 
ror rate of the current software alternatives. Also, the DRS method 
will be extremely powerful if it comes to analyzing the nature and 
dynamics of specific haplotypes in coronavirus populations under 
specific selection pressures, for example, mutations and/or drugs 
affecting replication efficiency and others. 

Our work also serves as a proof of concept showing that con- 
sistent RNA modifications can be detected using nanopore DRS. 

To fully exploit the potential of DRS, several improvements 
are needed: First and foremost, a significant reduction of the cur- 
rently very high per-read error rate is crucial. This is especially 
problematic in studies focusing on intra-sample heterogeneity 
and haplotypes. Second, protocols that limit RNA degradation dur- 
ing library preparation would be of great value. This could be 
achieved by shortening the library protocol. To limit the cost of 
DRS, barcoded adapters would be desirable. On the bioinformatics 
side, the basecaller for DRS data is still at an early stage and, for ex- 
ample, cannot accurately call the poly(A) regions as well as the 
RNA-DNA-hybrid adapter sequences. Further basecalling errors 
likely result from RNA modifications, which need to be modeled 
more accurately. However, once these limitations will be fixed, 
the use of nanopore-based DRS can be expected to greatly advance 
our understanding of the genomics of virus populations and their 
multiple haplotypes. 


Methods 


RNA virus samples 


The two total RNA samples used in this study for DRS (ONT 
MinION) and [lumina sequencing were prepared at 24 h post 
infection from Huh7 cells infected at an MOI of three with recom- 
binant HCoV-229E WT, HCoV-229E_SL2-SARS-CoV, and HCoV- 
229E_SL2-BCoV, respectively (Madhugiri et al. 2018). Before 
sequence analysis, the two RNA samples obtained from HCoV- 
229E_SL2-SARS-CoV-infected and HCoV-229E_SL2-BCoV-infect- 
ed cells were pooled (SL2 sample) (see Supplemental Fig. S7). 

Generation of recombinant viruses and total RNA isolation 
were performed as described previously (Madhugiri et al. 2018). 
Briefly, full-length cDNA copies of the genomes of HCoV-229E 
(GenBank accession number NC_002645), HCoV-229E_SL2- 
SARS-CoV, and HCoV-229E_SL2-BCoV, respectively, were engi- 
neered into recombinant vaccinia viruses using previously de- 
scribed methods (Thiel et al. 2001; Hertzig et al. 2004; Thiel and 
Siddell 2005). Next, full-length genomic RNAs of HCoV-229E, 
HCoV-229E_SL2-SARS-CoV, and HCoV-229E_ SL2-BCoV, respec- 
tively, were transcribed in vitro using purified Clal-digested geno- 
mic DNA of the corresponding recombinant vaccinia virus as a 
template; 1.5 pg of full-length viral gnome RNA, along with 
0.75 pg of in vitro transcribed HCoV-229E nucleocapsid protein 
mRNA, was used to transfect 1x10° Huh7 cells using the 
TransIT-mRNA transfection kit according to the manufacturer’s in- 
structions (Mirus Bio). At 72 h post transfection (p.t.), cell culture 
supernatants were collected and serially passaged in Huh7 cells for 
21 (WT) or 12 times (HCoV-229E_SL2-SARS-CoV and HCoV- 
229E_SL2-BCoV), respectively. 


Nanopore sequencing and long-read assessment 


For nanopore sequencing, 1 pg of RNA in 9 pL was carried into the 
library preparation with the Oxford Nanopore DRS protocol (SQK- 
RNAO01). All steps were followed according to the manufacturer’s 
specifications. The library was then loaded on an R9.4 flow cell and 


sequenced on a MinION device (ONT). The sequencing run was 
terminated after 48 h. 

The raw signal data were basecalled using Albacore (v2.2.7; 
available through the Oxford Nanopore community forum). 

Although it is customary to remove adapters after DNA se- 
quencing experiments, we did not perform this preprocessing 
step. The reason is that the sequenced RNA is attached to the 
adapter molecule via a DNA linker, effectively creating a DNA- 
RNA chimera. The current basecaller, being trained on RNA, is 
not able to reliably translate the DNA part of the sequence into 
base space, which makes adapter trimming based on sequence dis- 
tance unreliable. However, we found that the subsequent mapping 
is very robust against these adapter sequences. All mappings were 
performed with minimap2 (v2.8-r672) (Li 2018) using the 
“spliced” preset without observing the canonical GULAG splicing 
motif (parameter -u n), and k-mer size set to 14 (-k 14). 

Raw reads coverage and sequence identity to the HCoV-229E 
reference genome (WT: GenBank, NC_002645.1; SL2: stem loop 2 
sequence replaced with SARS-CoV SL2 sequence) were determined 
from mappings to the references produced by minimap2. Read 
origin and sequencing error statistics were assessed by mapping 
the reads simultaneously with minimap2 to a concatenated 
mock-genome consisting of HCoV-229E (WT and SL2 variants, re- 
spectively), yeast enolase 2 mRNA (calibration strand, GenBank, 
NP_012044.1), and the human genome (GRCh38). Identity and 
error rates are the number of matching nucleotides (or number 
of nucleotide substitutions, insertions, or deletions) divided by 
the total length of the alignment including gaps from indels. 

Consensus calling of nanopore reads was performed with 
Ococo (v0.1.2.6) (Brinda et al. 2017). The minimum required 
base quality was set to zero in order to avoid gaps in low coverage 
domains. 

We used the hybrid error correction tool HG-CoLoR (Morisse 
et al. 2018) in conjunction with the Illumina HiSeq short-read data 
sets of both samples to reduce errors in all reads that exceed 20,000 
nt in length. The program builds a de Bruijn graph from the near 
noise-free short-read data and then substitutes fragments of the 
noisy long reads with paths found in the graph that correspond 
to that same fragment of the sequence. HG-CoLoR was run with 
default parameters except for the maximum order of the de 
Bruijn graph, which was set to 50 in order to fit the length of the 
short reads. 


Illumina HiSeq sequencing and assembly 


Illumina short-read sequencing was performed using the TruSeq 
RNA v2 kit to obtain RNA from species with poly(A) tails and with- 
out any strand information. The three samples (WT, SL2_SARS- 
CoV, SL2_BCoV) selected for this study were prepared on a HiSeq 
2500 lane and sequenced with 51 cycles. After demultiplexing, 
23.2, 22.0, and 23.8 million single-end reads were obtained for 
the WT and the two SL2 samples, respectively. The raw sequencing 
data were deposited at the OSF (doi:10.17605/OSF.IO/UP7B4) and 
ENA (PRJEB33797). 


Characterization of transcript isoforms and sg RNAs 


We first defined TRS as an 8-mer with a maximum Hamming dis- 
tance of two from the motif UCUCAACU. We then searched the 
HCoV-229E reference genome (GenBank, NC_002645.1) for all 
matching 8-mers. We then synthesized sg RNAs in silico as follows: 
For each pair of complementary 8-mers (5’-TRS, 3’-TRS), we accept- 
ed at most one mismatch to simulate base-pairing under a stable 
energy state. We then joined two reference subsequences for 
each pair or TRS: first, the 5’-end up to but not including the 5’- 
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TRS and, second, the 3’-end of the reference genome including the 
3’-TRS and excluding the poly(A) tail. 

This way we obtained about 5000 candidate sg RNAs. To val- 
idate them, we mapped the nanopore reads to these “mock” sg 
RNAs in a nondiscontinuous manner; that is, all reads had to 
map consecutively without large gaps. To count as a putative hit, 
95% of the read length had to uniquely map to a given mock tran- 
script, and the mock transcript could not be >5% longer than the 
read. 

We only considered putative hits as plausible if they had a 
read support of at least five. With this threshold, we aim to balance 
the sensitivity of finding plausible novel transcripts with a need to 
control the number of false positives. 


Identification of 5mC methylation 


We used Tombo (v1.3, default parameters) (Stoiber et al. 2016) to 
identify signal level changes corresponding to SmC methylation 
(see Supplemental Fig. S6). 

To assess the FPR of the methylation calling, we used an RCS 
as a negative control. It is added in the standard library preparation 
protocol for DRS. This mRNA standard is derived from the yeast 
enolase II (YHR174W) gene (Engel et al. 2014) and is produced us- 
ing an in vitro transcription system. As a consequence, the mRNA 
standard is not methylated. 

For a conservative resquiggle match score of 1.3 (part of the 
Tombo algorithm, default setting for RNA) and a methylation 
threshold of 0.9, the FPR was 4.67%, which met our requirement 
that the FPR be <5%. Our experimental setup did not include a pos- 
itive methylation control. 


Data access 


Both the short-read cDNA (Illumina) and the basecalled long-read 
RNA (ONT) data from this study have been submitted to the 
European Nucleotide Archive (ENA; http://www.ebi.ac.uk/ena) 
under accession number PRJEB33797, as well as to the Open 
Science Framework repository UP7B4. Raw long-read data in 
fastS format have also been submitted to this OSF repository. All 
analysis code has been submitted to the same OSF repository 
and is also available in Supplemental_Code.zip from the 
Supplemental Material. 
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